생성적 적대 신경망과 데이터 확장을 이용한 딥러닝 기반 TTS 음질 개선

최 진; 양진혁; 김인중; Jin Choi; Jinhyeok Yang; Injung Kim

연구문헌

국내 논문지

홈 > 연구문헌 > 국내 논문지 > 한국정보과학회 논문지 > 정보과학회 컴퓨팅의 실제 논문지 (KIISE Transactions on Computing Practices)

정보과학회 컴퓨팅의 실제 논문지 (KIISE Transactions on Computing Practices)

Current Result Document :

한글제목(Korean Title)	생성적 적대 신경망과 데이터 확장을 이용한 딥러닝 기반 TTS 음질 개선
영문제목(English Title)	Fidelity Enhancement for Deep Learning-based TTS using a Generative Adversarial Network and Data Augmentation
저자(Author)	최 진 양진혁 김인중 Jin Choi Jinhyeok Yang Injung Kim
원문수록처(Citation)	VOL 26 NO. 05 PP. 0256 ~ 0260 (2020. 05)
한글내용 (Korean Abstract)	본 논문에서는 생성적 적대 신경망을 이용해 딥러닝 기반 TTS 모델이 합성한 멜 스펙트로그램을 실제 음성의 멜 스펙트로그램과 유사해지도록 개선하는 딥러닝 모델 TE-GAN(TTS Enhancement GAN)을 소개한다. TE-GAN은 음성 신호의 특성을 고려해 설계되었으며, 그리핀-림 알고리즘과 같은 간단한 보코더와 결합되어도 음질 개선 효과가 우수하다. 추가적으로 TE-GAN의 효과적인 학습을 위해 시간적 다중 에이전트(temporal multi-agent, TMA)에 의한 데이터 확장 방법을 제안한다. 실험을 통해 제안하는 방법들이 TTS 시스템이 합성한 음성의 음질을 크게 개선할 수 있음을 보였다. 실험에서 TE-GAN은 Tacotron 이 합성한 멜 스펙트럼을 실제 음성의 멜 스펙트럼과 유사하도록 개선하였으며, 합성된 음성의 MOS도 2.07에서 MOS가 3.24로 크게 개선되었다.
영문내용 (English Abstract)	In this paper, we introduce TE-GAN (TTS enhancement GAN) a deep learning model that enhances the Mel-spectrogram synthesized by a deep learning-based TTS model to be similar to that of human speech using a generative adversarial network. TE-GAN was designed by considering the characteristics of speech signals, and can significantly improve the fidelity of speech signals even when it is combined with a simple vocoder such as the Griffin-Lim algorithm. Additionally, we present a data augmentation technique using a Temporal Multi-Agent (TMA) approach for effective learning. Experimental results demonstrate that the proposed methods significantly improve the fidelity of the speech signals synthesized by the TTS system. In experiments, TE-GAN improved the Mel-spectrogram of Tacotron to make it more similar to the Mel-spectrogram of human speech, on top of this the MOS of synthesized speech was improved significantly from 2.07 to 3.24
키워드(Keyword)	딥러닝 음성합성 생성적 적대 신경망 데이터 확장 deep learning speech synthesis generative adversarial network TTS 음질 개선 data augmentation TTS fidelity enhancement
파일첨부	PDF 다운로드